New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Distributed ml on k8s - first tutorial #280

Merged

matbun merged 21 commits into main from dist-ml-k8s

Jan 9, 2025

Collaborator

matbun commented Jan 7, 2025 •

edited

Loading

This is somehow related to #172, but introduces distributed ML on k8s using Kubeflow's training operator, rather than focusing on pipelines or ML tracking.

Other changes:

Fix some problems in the pyproject.toml, which now should be a bit more stable (updated uv lock accordingly).
Updated the TorchTrainer and TorchDistributedStrategy to support gloo backend for CPU-only distributed training.
Added support for tqdm progress bars in the trainer (can be disabled by the user)... Not a fan of tqdm, but it is useful when running k8s pods.

Future work include distributed ML on GPU nodes and k8s persistent volumes.

matbun requested review from jarlsondre and annaelisalappe

January 7, 2025 14:09

jarlsondre reviewed

View reviewed changes

docs/getting-started/slurm.rst Outdated Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

docs/tutorials/tutorials.rst Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

src/itwinai/torch/distributed.py Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

src/itwinai/torch/distributed.py Outdated Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

src/itwinai/torch/distributed.py Outdated Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

src/itwinai/torch/trainer.py Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

tutorials/distributed-ml/torch-kubeflow-1/README.md Outdated Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

tutorials/distributed-ml/torch-kubeflow-1/README.md Outdated Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

tutorials/distributed-ml/torch-kubeflow-1/train-cpu.py Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

tutorials/distributed-ml/torch-kubeflow-1/train-cpu.py Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

tutorials/distributed-ml/torch-kubeflow-1/train-cpu.py Outdated Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

tutorials/distributed-ml/torch-kubeflow-1/train-cpu.py Outdated Show resolved Hide resolved

annaelisalappe reviewed

View reviewed changes

tutorials/distributed-ml/torch-kubeflow-1/README.md Outdated Show resolved Hide resolved

tutorials/distributed-ml/torch-kubeflow-1/README.md Outdated Show resolved Hide resolved

tutorials/distributed-ml/torch-kubeflow-1/README.md Show resolved Hide resolved

jarlsondre reviewed

View reviewed changes

tutorials/distributed-ml/torch-kubeflow-1/train-cpu.py Show resolved Hide resolved

matbun requested review from annaelisalappe and jarlsondre

January 8, 2025 09:43

matbun added 12 commits

January 9, 2025 15:58


          First pod poc

3de3fb9


          Update with torchrun

a523026


          Update README

b221842


          First working version of CPU-only distrib ML

6cf7571


          Cleanup Dockerfile

71b0fde


          fix linter


          Small fixes

3163acb


          Add Kubeflow tutorial to list

aca43db


          Polished tutorial

fe8f011


          Fix note

b44cc16


          cleanup spaces

e7afef4


          improve text

230bb30

matbun and others added 9 commits

January 9, 2025 15:58


          improve text

c747a6b


          Update distributed.py

896c750


          Update README.md

e6b35cc


          Update README.md

e08bd9f


          Update README.md

0c8581b


          Update train-cpu.py

77f8eca


          Update distributed.py

270474d


          fix linter

c275a8c


          clarify salloc

07f6160

matbun force-pushed the dist-ml-k8s branch from 72f18bc to 07f6160 Compare

January 9, 2025 14:58

jarlsondre approved these changes

View reviewed changes

Collaborator

jarlsondre left a comment

LGTM

matbun merged commit 3fc10ef into main

13 checks passed

matbun deleted the dist-ml-k8s branch

January 9, 2025 15:27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet